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Introduction and history 


My main long-term project is GNU epsilon. It’s a programming 
language, meant to be efficient, but: 

@ very “dynamic” in certain execution phases 

ə written in itself, bootstrapped 
— Too slow. 
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My main long-term project is GNU epsilon. It’s a programming 
language, meant to be efficient, but: 

@ very “dynamic” in certain execution phases 

ə written in itself, bootstrapped 
— Too slow. 


So | wrote a canonical threaded-code VM. 
@ speedup 4-6x 
— Too little. 


So | made a separate repository to experiment with language VMs. 
ə tried techniques from scientific papers (many by Anton Ertl 
and the other GForth people) 
ə added ideas of my own 
@ it got completely out of hand 
A new project, independent from epsilon. 
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Why you should care 





Interpreters are common: 


programming languages 
application scripting 
shells 


regular expressions. . . 
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Why you should care 


Interpreters are common: 
ə programming languages 
@ application scripting 
ə shells 


ə regular expressions. .. 


ə We are getting used to unacceptably bad performance. 


| will present my new software, but first | need to describe the 
problem it solves. This will take a while. 
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Basics 





Our running example — at first in C 


Count down from two billion (here meaning 2 - 109): 


int 
main (void) 
{ 
long i; 
for (i = 2000000000; i != 0; i --) 
/* Do nothing */; 
return 0; 


} 
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Basics 





Our running example — at first in C 


Count down from two billion (here meaning 2 - 109): 


int 
main (void) 
{ 
long i; 
for (i = 2000000000; i != 0; i --) 
/* Do nothing */; 
return 0; 


} 





... does this program really count down? 
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Basics Example Linear s Arg 





Our running example — at first in C, now actually counting 


Count down from 2 - 10° without optimizing away the entire loop: 


C (with GNU extensions) 










int 
main (void) 


{ 
long i; 
for (i = 2000000000; i != 0; i --) 
asm volatile ("" : : "r" (i)); // pretend to use i 
return 0; 
} 
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Our running example — at first in C, now actually counting 


Count down from 2 - 10° without optimizing away the entire loop: 


C (with GNU extensions) 










int 
main (void) 


{ 
long i; 
for (i = 2000000000; i != 0; i --) 
asm volatile ("" : : "r" (i)); // pretend to use i 
return 0; 
} 


(We still want most GCC optimizations!) 
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Our running example — at first in C, now actually counting 


Count down from 2 - 10° without optimizing away the entire loop: 


C (with GNU extensions) 










int 
main (void) 


{ 
long i; 
for (i = 2000000000; i != 0; i --) 
asm volatile ("" : : "r" (i)); // pretend to use i 
return 0; 
} 


(We still want most GCC optimizations!) 
[Demo: the down-counter in a few languages] 
SAA- 


5/71 
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Basics Example 





You can play with the sources 


| will (quickly) show some interpreters written in C. 


In case you want to play with the examples yourself, the little 
programs l'm showing here are on my server: 


http: //ageinghacker .net/ghm-2017 


These are naif C programs showing how interpreters work; the C 
files in c-examples/ are not part of my new project. 
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Basics 





How simple interpreters work 


The interpreted program is a data structure in memory. 
“find the next point in the interpreted program, execute it, repeat 
from start” 
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Basics Example 


How simple interpreters work 





The interpreted program is a data structure in memory. 
“find the next point in the interpreted program, execute it, repeat 
from start” 


How to dispatch [“dispatch”: moving from a VM program point to 
another]: 

ə Abstract Syntax Tree (AST) interpreters 

ə Linear programs 


ə switch dispatching 
e direct threading 
| ree 
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Basics Example 


How simple interpreters work 





The interpreted program is a data structure in memory. 
‘find the next point in the interpreted program, execute it, repeat 
from start” 


How to dispatch ["dispatch": moving from a VM program point to 
another]: 


ə Abstract Syntax Tree (AST) interpreters 
ə Linear programs 
ə switch dispatching 


e direct threading 
| ace 


How to access data: 
ə associative data structures (alists, hash tables) 
ə VM registers 


@ stacks 
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Basics E iple AST Linea 





Our down-counter as an Abstract Syntax 


i := 2000000000; 
do 

decrement i; 
while i != 0; 
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Basics Example AST Lin 





Our down-counter as an Abstract Syntax Tree 


sequence 
assign do-while 
i := 2000000000; 4” \ A N 
do i literal decrement 
decrement i; i i 
while i != 0; l : N 


2000000000 i variable literal 


i 0 
A program is an Abstract Syntax Tree data structure in memory: 
heap-allocated structs and unions with lots of pointers. Each 
node has an enum field to distinguish its kind. 
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Basics Example AST Lin 


Our down-counter as an Abstract Syntax Tree 





sequence 
assign do-while 
i := 2000000000; 4” \ A N 
do i literal decrement 
decrement i; i ! 
while i != 0; ! i / ~ 
2000000000 i variable literal 


i 0 
A program is an Abstract Syntax Tree data structure in memory: 
heap-allocated structs and unions with lots of pointers. Each 
node has an enum field to distinguish its kind. 


[Blue: expression node; dashed line: child is a struct field of parent; black 


arrow: parent contains pointer to child.] 
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Basics Example AST Linea 


Abstract Syntax Tree interpreter: expression 





As each complex AST has sub-ASTs recursion is natural. AST data 
structures are easy to define in Lisp and ML, a little less pretty in C. 


long 
interpret_expr (const struct expr *e, const long *vars) { 
switch (e->expr_case) { 
case expr_variable: 
return vars [e->var_index] ; 
case expr_constant: 
return e->cnst; 
case expr_is_different: 
return ( interpret_expr (e->subi, vars) 
!= interpret_expr (e->sub2, vars)); 
default: 
error (); 
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Basics Example AST Linea 





Abstract Syntax Tree interpreter: statement 


void interpret_stmt (const struct stmt *s, long *vars) { 
switch (s->stmt_case) { 
case stmt_sequence: 
interpret_stmt (s->sub1, vars); 
interpret_stmt (s->sub2, vars); 
break; 
case stmt_assign: 
vars [s->var_index] = interpret_expr (s->assigned_expr, vars) ; 
break; 
case stmt_decrement: 
vars [s->var_index] --; 
break; 
case stmt_dowhile: 
interpret_stmt (s->body, vars); 
if (interpret_expr (s->guard, vars) ) 
interpret_stmt (s, vars); 
break; 
default: error (); 


} 
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Basics Example AST Linea 





AST interpreter performance 


ə pointer chasing (load latency ~ 37 on L1d hit!) 
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Basics Example AST Linea 





AST interpreter performance 


ə pointer chasing (load latency ~ 37 on L1d hit!) 
@ many conditionals, often multi-way (mispredict penalty ~ 157, 
per conditional branch!) 
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Basics Example AST Linea 





AST interpreter performance 


ə pointer chasing (load latency ~ 37 on L1d hit!) 

@ many conditionals, often multi-way (mispredict penalty ~ 157, 
per conditional branch!) 

ə variable lookup slow (not shown in my sample code before) 
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AST interpreter performance 


ə pointer chasing (load latency ~ 37 on L1d hit!) 

@ many conditionals, often multi-way (mispredict penalty ~ 157, 
per conditional branch!) 

ə variable lookup slow (not shown in my sample code before) 

@ recursion, often non-tail 
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Basics Example AST Linea 





AST interpreter performance 


ə pointer chasing (load latency ~ 37 on Lid hit!) 

@ many conditionals, often multi-way (mispredict penalty ~ 157, 
per conditional branch!) 

ə variable lookup slow (not shown in my sample code before) 

ə recursion, often non-tail 


sequence 
assie X VA while 
A a ~ 
2000000000 i e Au 
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Basics 





A good language to interpret 


What is normally called a language “Virtual Machine” is an 
interpreter for a lower-level linear program: 


ə the program to interpret is stored as a contiguous array in 
hardware memory 


@ no nesting: no statements with sub-statements or expressions 
with sub-expressions 


@ no expressions, no variables 


@ assembly-like feel: registers or stacks, explicit jumps 
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Basics 





A good language to interpret 


What is normally called a language “Virtual Machine” is an 
interpreter for a lower-level linear program: 


ə the program to interpret is stored as a contiguous array in 
hardware memory 


@ no nesting: no statements with sub-statements or expressions 
with sub-expressions 


@ no expressions, no variables 


@ assembly-like feel: registers or stacks, explicit jumps 


I'll show you a linear-program interpreter written in C. 
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Basics E ipl Linear 


The down-counter as a linear program to be interpreted 





set 2000000000, %r0 
set -1, %ri 

$L1: add %r0, %ri, %r0 
bnz %r0, $L1 
end 
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Basics Exampl Linear Th Args Stacks 


The down-counter as a linear program to be interpreted 





set 2000000000, %r0 
set -1, %ri 

$L1: add %r0, %ri, %r0 
bnz %r0, $L1 
end 


ə VM registers are an 
array in hardware 
memory. 





VM %ro 
VM %r1 
VM %r2 
VM %r3 
VM %r4 





























Luca Saiu http://ageinghacker.net The art of the language VM — GNU Hackers’ Meeting 2017 


Basics AST Linear s 


The down-counter as a linear program to be interpreted 






























































set 2000000000, %r0 e 
eee 2000000000 
$L1: add %r0, %ri, %rO 0 
bnz %r0, $L1 ; 
insn_set 
end 
ə VM registers are an 1 
array in hardware insn_add 
memory. 0 
ə The VM program is an VM %r0 1 
array in hardware VM %rt 0 
memory. VM Yxr2 insn_bnz 
VM %r3 2 
VM hr insn_end 
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Basics AST Linear swi 


The down-counter as a linear program to be interpreted 













































































set 2000000000, %r0 EE 
eee 2000000000 
$L1: add ZrO, %r1, %r0 - 
bnz %r0, $L1 naa E 
aid regs N insn_set 
end - 

ə VM registers are an 1 
array in hardware insn_add 
memory. 0 

ə The VM program is an VM %xr0 1 
array in hardware VM %rt 0 
memory. . . VM %r2 insn_bnz 

@ Only the interpreter’s VM %r3 0 
automatic C variables ae 
are in hardware VM hr insn_end 























registers. 
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Basics Linear switch Thre 





The simplest linear-program inte 


What's the C type of insn_set, insn_add, insn_bnz, insn_end? 
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Basics Linear switch Thre 





The simplest linear-program inte 


What's the C type of insn_set, insn_add, insn_bnz, insn_end? 


ə It’s an enum insn: essentially an integer. 
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Basics Linear switch Thre 





The simplest linear-program inte 


What's the C type of insn_set, insn_add, insn_bnz, insn_end? 
ə It’s an enum insn: essentially an integer. 
@ There are also pointers in the VM program array from an 
element to another... 
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Basics 





The simplest linear-program inte 


What's the C type of insn_set, insn_add, insn_bnz, insn_end? 


ə It’s an enum insn: essentially an integer. 

@ There are also pointers in the VM program array from an 
element to another... 

ə Linear-program interpreters work best with word-sized data: 
objects as wide as a hardware register. unions are useful for 
this: 


union value 


{ 
enum insn in; 
long i; // or another integer type of the right width 
union value *p; 


y; 
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Basics 





The simplest linear-program inte 


What's the C type of insn_set, insn_add, insn_bnz, insn_end? 
ə It’s an enum insn: essentially an integer. 
ə There are also pointers in the VM program array from an 
element to another... 
ə Linear-program interpreters work best with word-sized data: 
objects as wide as a hardware register. unions are useful for 
this: 


union value 





{ 
enum insn in; 
long i; // or another integer type of the right width 
union value *p; 
33 
This interpretation style is called switch dispatching. eS 


[switch dispatching: C source and demo] 14/71 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Basics Example AS 2 Threading Arg 


Problems with switch-dispatching 





Performance of a switch-dispatching interpreter: 
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Problems with switch-dispatching 


Performance of a switch-dispatching interpreter: 


ə switch is somewhat inefficient (range checking) 
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Problems with switch-dispatching 


Performance of a switch-dispatching interpreter: 
@ switch is somewhat inefficient (range checking) 


ə The CPU branch target predictor can't work well: one jumping 
instruction with many possible targets, complex repetition 
patterns. 
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Problems with switch-dispatching 


Performance of a switch-dispatching interpreter: 
@ switch is somewhat inefficient (range checking) 


ə The CPU branch target predictor can't work well: one jumping 
instruction with many possible targets, complex repetition 
patterns. 


ə Every VM instruction ends with another jump to the one 
shared switch. 
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Problems with switch-dispatching 


Performance of a switch-dispatching interpreter: 
@ switch is somewhat inefficient (range checking) 


ə The CPU branch target predictor can’t.work well: one jumping 
instruction with many possible targets, complex repetition 
patterns. 


ə Every VM instruction ends with another jump to the one 
shared switch. 
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Basics z Threading Ars 





Computed goto 


GCC introduced the C extension called computed goto or 
labels-as-values: 
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Computed goto 


GCC introduced the C extension called computed goto or 
labels-as-values: 


ə The expression && label, of type void *, evaluates to the 
address of the hardware machine instruction where the labeled 
code begins; you can store the address and jump to it later. 
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Computed goto 


GCC introduced the C extension called computed goto or 
labels-as-values: 


ə The expression && label, of type void *, evaluates to the 
address of the hardware machine instruction where the labeled 
code begins; you can store the address and jump to it later. 


ə The statement goto *expr jumps to the result of the 
evaluation of expr. 
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Computed goto 


GCC introduced the C extension called computed goto or 
labels-as-values: 


ə The expression && label, of type void *, evaluates to the 
address of the hardware machine instruction where the labeled 
code begins; you can store the address and jump to it later. 


ə The statement goto *expr jumps to the result of the 
evaluation of expr. 


We can use pointers to native code instead of instead of enums in 
the VM program, at the beginning of every VM instruction. 

This is called direct-threaded code (nothing to do with 
multi-threading). 
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Basics Example Linear s Threading Arg 





The down-counter program for a direct-threaded VM 




















2000000000 j Compiled hardware machine 























insn 

regs a — code for set 
-1 
1 








Compiled hardware machine 


d 0 code for add 

































































VM %r0 Compiled hardware machine 
VM %r1 0 code for bnz 

VM %r2 0 

VM %r3 —= Compiled hardware machine 
VM 4r4 code for end 














Instead of an enum identifier each VM instruction in the VM 
program begins with a pointer to its native code. 
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Basics E Linea Threading Arg 


Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
ə no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
@ no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
@ no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 
ə load the next VM instruction code pointer from it 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Basics E Linea Threading Ars 


Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
@ no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 


ə load the next VM instruction code pointer from it 
ə goto * to the code pointer 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
ə no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 
ə load the next VM instruction code pointer from it 
ə goto * to the code pointer 
ə Many different jumping hardware instructions: less bad for the 
hardware branch target predictor 
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Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
@ no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 
ə load the next VM instruction code pointer from it 
ə goto * to the code pointer 
ə Many different jumping hardware instructions: less bad for the 
hardware branch target predictor 


ə (also, still as compact in memory as switch-dispatching: 
useful for small embedded systems, but not particularly for 


GNU) 
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Basics 


Threading Ars 


Direct-threaded int 





In direct threading: 
@ interpreting the VM instruction pointed by a C pointer p is 
trivial: goto *p; 
ə there's no switch 
@ no infinite loop or jump to a shared conditional: each VM 
instruction “falls thru” to the next: 
ə move insn forward 
ə load the next VM instruction code pointer from it 
ə goto * to the code pointer 
ə Many different jumping hardware instructions: less bad for the 
hardware branch target predictor 


ə (also, still as compact in memory as switch-dispatching: 
useful for small embedded systems, but not particularly for 


GNU) 





[C source and demo] ` sj 
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Basics 


Direct-threaded fallthru (nop): diagram 


Threading A 





The zero-argument VM instruction nop does nothing and just falls 
thru to the next instruction. 
The jump destination address is pointed from memory (red arrow). 
The green arrow is the pointer insn, already in a hardware register. 
























































aR something 
regs im | something Compiled hardware machine 
| code for nop 

VM %ro something Compiled hardware machine 
VM %r1 code for whatever comes next 
VM %r2 
VM %r3 
VM %r4 








There is nothing between the code pointer for nop and the code 


pointer for the next VM instruction because nop has no arguments. “emi 
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Direct-threaded fallthru (nop): code 


Here's the source for the VM instruction nop in the 
direct-threading interpreter: 








label_nop: 
insn ++; // No args to skip, just the code pointer 
goto * insn->label; 
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Direct-threaded fallthru 





Here's the source for the VM instruction nop in the 
direct-threading interpreter 





label_nop: 
insn ++; // No args to skip, just the code pointer 
goto * insn->label; 





compiled (x86_ 64) 





movq 8(4rax), %rdx #insn is in rar; load (insn + 1)->Label 
addq $8, %rax #advance insn to the next instruction 
jmpq */rdx #7ump to the address we loaded before 
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Basics 


Threading Args 


Direct-threaded fallthru 





Here's the source for the VM instruction nop in the 
direct-threading interpreter: 





label_nop: 


insn ++; // No args to skip, just the code pointer 
goto * insn->label; 





compiled (x86_ 64) 





movq 8(4rax), %rdx #insn is in rar; load (insn + 1)->Label 
addq $8, %rax #advance insn to the next instruction 
jmpq */rdx #7ump to the address we loaded before 





GCC has put insn in the hardware register %4rax. The load (movq 
on x86_ 64) follows the red arrow, from rax + 8. The hardware 
register %rdx is a temporary, holding the address where to jump. 
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Basics 





The b VM instruction takes a label as its parameter: the next VM 
program slot after b's code pointer points to the beginning of the target 
instruction (another slot in the program containing a code pointer). 


















































— Compiled hardware machine 
code for b 
VM %r0 
VM %r1 





























Compiled hardware machine 
VM 4r4 code for the target instruction 
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Direct-threaded unconditional branch (b): code 


The (one-argument) VM instruction b in the direct-threading 
interpreter: 








label_b: 
insn = insn[1].p; 
goto * insn->label; 
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Direct-threaded unconditional branch (b): code 


The (one-argument) VM instruction b in the direct-threading 
interpreter: 








label_b: 
insn = insn[1].p; 
goto * insn->label; 


compiled (x86_64) 





movq 8(4/rax), frax # load jump destination from *(insn + 1) 
jmpq *(/,rax) # jump indirect via memory: another load 





The first instruction loads the next insn, still pointing within the 
program array. The jump-via-memory instruction chases a pointer 
from it and obtains a pointer into a “blue” box, the hardware 

instruction where to jump where the target VM instruction begins. 
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Direct-threaded conditional branch (bnz 


The two-argument VM instruction bnz in the direct-threading 
interpreter: 










label_bnz: 
if (regs[insn[1].i] != 0) 
insn = insn[2].p; 
else 
insn += 3; 
goto * insn->label; 
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Direct-threaded conditional branch (bnz 


The two-argument VM instruction bnz in the direct-threading 
interpreter: 






compiled (x86_ 64, simplified) 
movq 8(/rax), /%rdx 
cmpq $0, -256(%rbp, Ardx,8) 

je L 

movq 16(%rax), %rax # Like b 
jmpq *(4%rax) 

L: addq $24, %rax # Fallthru 
jmpq *(%rax) 

















label_bnz: 
if (regs[insn[1].i] != 0) 
insn = insn[2].p; 
else 
insn += 3; 
goto * insn->label; 
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Direct-threaded conditional branch (bnz 


The two-argument VM instruction bnz in the direct-threading 
interpreter: 


compiled (x86_ 64, simplified) 


movq 8(/rax), “%rdx 
cmpq $0, -256(%rbp, %rdx,8) 












label_bnz: 


if (regs[insn[1].i] != 0) j 
insn = insn[2].p; Je L 
else movq 16(%rax), ⁄rax # Like b 


jmpq *(%rax) 
L: addq $24, %rax # Fallthru 
jmpq *(%rax) 


insn += 3; 
goto * insn->label; 





Check the condition; if false skip past (je) unconditional branch 
code, and into fallthru dispatch code. 
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Direct-threaded conditional branch (bnz 


The two-argument VM instruction bnz in the direct-threading 
interpreter: 






compiled (x86_64, simplified) 










movq 8(/rax), /~rdx 


label_bnz: ‘ ; 
if (regs[insn[1].i] != 0) cmpq $0, -256(Zrbp, 4rdx, 8) 
insn = insn[2].p; Je L , ; 
else movq 16(/rax), ⁄rax # Like b 
insn += 3; jmpq *(%rax) 
goto * insn->label; L: addq $24, /rax # Fallthru 


jmpq *(%rax) 





Check the condition; if false skip past (je) unconditional branch 
code, and into fallthru dispatch code. 


Lots of hardware branches, depending on memory and on each 
other. 
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Direct threading dispatch performance 
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Direct threading dispatch performance 


The real question is whether we can do better, and where the 
bottleneck is. 


Is branching/fallthru the only source of inefficiency? 
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Direct threading dispatch performance 


The real question is whether we can do better, and where the 
bottleneck is. 


Is branching/fallthru the only source of inefficiency? 


[Demo: quick timing against switch-dispatching] 
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(Direct-threaded) VM add: ‘fundamental’/RISC operations 


Let's look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 








insn 


N 








regs 
| 





| 








VM %ro 








VM %r1 











VM %r2 


$< 
3 
0 
1 
el 











VM %r3 











Luca Saiu 











http://ageinghacker .net 





Compiled hardware machine 
code for add 














Compiled hardware machine 
code for the following instruc- 


tion 
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(Direct-threaded) VM add: “fundamental” /RISC operations 


Let's look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 











insn N 


regs | S Compiled hardware machine 
3 
0 
1 
ee | 























| code for add 






































VM %r0 

VM %r1 

VM %r2 Compiled hardware machine 
s code for the following instruc- 

VM %r3 


tion 























@ read VM register indices (load from insn[k] obtaining 3, 0, 1) 
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(Direct-threaded) VM add: menta! '/RISC operations 


Let’s look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 











insn N 


regs — Compiled hardware machine 
3 
0 
1 
ee | 




















| code for add 
































VM %r0 

VM %r1 

VM %r2 Compiled hardware machine 
s code for the following instruc- 

VM %r3 























tion 











@ read VM register indices (load from insn[k] obtaining 3, 0, 1) 


@ read VM input register contents from the VM register array using input 
indices (load VM register elements %r3, %r0 using indices 3, 0) 
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(Direct-threaded) VM add: menta! '/RISC operations 


Let's look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 











insn N 


regs — Compiled hardware machine 
3 
0 
1 
e] 




















| code for add 
































VM %r0 

VM %r1 

VM %r2 Compiled hardware machine 
s code for the following instruc- 

VM %r3 























tion 











@ read VM register indices (load from insn[k] obtaining 3, 0, 1) 


@ read VM input register contents from the VM register array using input 
indices (load VM register elements %r3, %r0 using indices 3, 0) 


@ do the actual sum 
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(Direct-threaded) VM add: “fundamental” /RISC operations 


Let's look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 











insn N 


regs | Compiled hardware machine 
3 
0 
1 
el 




















| code for add 
































VM %r0 

VM %r1 

VM %r2 Compiled hardware machine 
s code for the following instruc- 

VM %r3 























tion 











@ read VM register indices (load from insn[k] obtaining 3, O, 1) 


@ read VM input register contents from the VM register array using input 
indices (load VM register elements %r3, %r0 using indices 3, 0) 


@ do the actual sum 
@ write result into VM register array (store into VM %r1 using index 1) 
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(Direct-threaded) VM add: “fundamental” /RISC operations 


Let's look at how the VM instruction add %r3, %r0, %r1 is represented in the 
VM program and what it needs to do in terms of hardware “operations”: 











insn 


regs | Compiled hardware machine 
3 
0 
1 
el 




















| code for add 
































VM %r0 

VM %r1 

VM %r2 Compiled hardware machine 
z code for the following instruc- 

VM %r3 























tion 











@ read VM register indices (load from insn[k] obtaining 3, O, 1) 


@ read VM input register contents from the VM register array using input 
indices (load VM register elements %r3, %r0 using indices 3, 0) 


@ do the actual sum 
@ write result into VM register array (store into VM %r1 using index 1) 


@ fallthru: increment-load-jump, as always 
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The VM instruction add (here direct-threaded), 


+ 
TI 


g Args Stacks 


compiled 





Is our three-operand add simple and fast, at least on a CISC? 









label_add: 
regs [insn[3] . i] 
=( regs[insn[1].i] 
+ regs[insn[2].i]); 
insn += 4; 
goto * insn->label; 
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The VM instruction add (here direct-threaded), compiled 


Is our three-operand add simple and fast, at least on a CISC? 
compiled (x86_ 64, simplified) 


movq 8(/rax), %rsi 

movq 16(4/rax), /rdx 

addq $32, /rax 

movq -8(4/rax), Arcx 

movq -256(/rbp,/rdx,8), %rdx 

addq -256(/rbp,4rsi,8), %rdx # + 
movq /rdx, -256(%rbp, Arcx, 8) 

movq (4rax), Ardx 

jmpq *rdx 









label_add: 
regs [insn[3] .i] 
= ( regs[insn[1] .i] 
+ regs[insn[2].i]); 
insn += 4; 
goto * insn->label; 
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The VM instruction add (here direct-threaded), compiled 


Is our three-operand add simple and fast, at least on a CISC? 
compiled (x86_ 64, simplified) 


movq 8(/rax), %rsi 

movq 16(/rax), /rdx 

addq $32, /rax 

movq -8(4/rax), Arcx 

movq -256(/rbp,/rdx,8), %rdx 

addq -256(/rbp,4rsi,8), %rdx # + 
movq /rdx, -256(%rbp, Arcx, 8) 

movq (4rax), %rdx 

jmpq *rdx 






label_add: 
regs [insn[3] .i] 
= ( regs[insn[1] .i] 
+ regs[insn[2].i]); 


insn += 4; 
goto * insn->label; 








@ the actual addition costs only one hardware instruction (the 
second addq, in black [which also includes one memory access]). 
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The VM instruction add (here direct-threaded), compiled 


Is our three-operand add simple and fast, at least on a CISC? 
compiled (x86_ 64, simplified) 


movq 8(/rax), %rsi 

movq 16(4/rax), /rdx 

addq $32, /rax 

movq -8(4/rax), /rcx 

movq -256(/rbp,/rdx,8), %rdx 

addq -256(/rbp,/rsi,8), %⁄rdx # + 
movq /rdx, -256(%rbp, Arcx, 8) 

movq (4rax), %rdx 

jmpq *rdx 






label_add: 
regs [insn[3] .i] 
= ( regs[insn[1] .i] 
+ regs[insn[2].i]); 
insn += 4; 
goto * insn->label; 








ə the actual addition costs only one hardware instruction (the 
second addq, in black [which also includes one memory access]). 

ə Fallthru to the next VM instruction: three hardware 
instructions (increment-load-jump). 
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The VM instruction add (here direct-threaded), compiled 


hreading Args S 





Is our three-operand add simple and fast, at least on a CISC? 
compiled (x86_ 64, simplified) 





movq 8(/rax), %rsi 

movq 16(4/rax), /rdx 

addq $32, /rax 

movq -8(4/rax), Arcx 

movq -256(/rbp,/rdx,8), %rdx 

addq -256(/rbp,4rsi,8), %rdx # + 
movq /rdx, -256(%rbp, Arcx, 8) 

movq (4rax), %rdx 

jmpq *rdx 





label_add: 
regs [insn[3] .i] 
= ( regs[insn[1] .i] 
+ regs[insn[2].i]); 
insn += 4; 


goto * insn->label; 








ə the actual addition costs only one hardware instruction (the 
second addq, in black [which also includes one memory access]). 

ə Fallthru to the next VM instruction: three hardware 
instructions (increment-load-jump). 

ə The other five hardware instructions only serve to access VM 
registers (and on RISCs it’s even worse). 
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(Direct-threaded) VM add: register indices and shifts 


In the C code for VM instructions we access VM register contents 
with expressions such as regs [idx], where idx is usually 
insn[k] .i for some constant k. 
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(Direct-threaded) VM add: register indices and shifts 


In the C code for VM instructions we access VM register contents 
with expressions such as regs [idx], where idx is usually 
insn[k] .i for some constant k. 


Reading insn[k] .i into idx costs one load instruction (register 
plus a known constant offset). 
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(Direct-threaded) VM add: register indices and shifts 


In the C code for VM instructions we access VM register contents 
with expressions such as regs [idx], where idx is usually 
insn[k] .i for some constant k. 


Reading insn[k] .i into idx costs one load instruction (register 
plus a known constant offset). Loading regs [idx] is more 
delicate: the address to load from is 


regs + idx - w 


where w is the word size in bytes (4 on 32-bit machines, 8 on 
64-bit machines). The multiplication requires a separate shift 
instruction on most RISC machines [plus possibly yet another instruction 
for summing regs and (idx - w): needed on RISC-V, MIPS, Alpha]. 
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(Direct-threaded) VM add: register indices and shifts 


g Args Stack 





In the C code for VM instructions we access VM register contents 
with expressions such as regs [idx], where idx is usually 
insn[k] .i for some constant k. 


Reading insn[k] .i into idx costs one load instruction (register 
plus a known constant offset). Loading regs [idx] is more 
delicate: the address to load from is 


regs + idx -w 


where w is the word size in bytes (4 on 32-bit machines, 8 on 
64-bit machines). The multiplication requires a separate shift 
instruction on most RISC machines [plus possibly yet another instruction 
for summing regs and (idx - w): needed on RISC-V, MIPS, Alpha]. 


Shifting at run time is silly: instead of keeping VM register indices 
in the VM program we can keep VM register offsets from regs, or Y 
in other words we can keep pre-shifted register indices. i 
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(Direct-threaded) VM add: operation dependency graph 





“a — b’ means that a uses the result of b, so b is executed before 
a. Thick arrows mean high latencies (~ 37). 


[Register index shifts shown, offset sums to regs base not shown] 
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(Direct-threaded) VM add: operation dependency graph 


“a — b” means that a uses the result of b, so b is executed before 
a. Thick arrows mean high latencies (~ 37). 


[Register index shifts shown, offset sums to regs base not shown] 


load idx 0 load idx 1 
shift idx 0 shift idx 1 
I I 
load VM reg 0 load VM reg 1 load idx 2 
load target add shift idx 2 
T ee 
jump update insn store VM reg 2 
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(Direct-threaded) VM add: operation dependency graph 


“a — b” means that a uses the result of b, so b is executed before 
a. Thick arrows mean high latencies (~ 37). 


[Register index shifts shown, offset sums to regs base not shown] 


load idx 0 load idx 1 
shift idx 0 shift idx 1 
I I 
load VM reg 0 load VM reg 1 load idx 2 
load target add shift idx 2 
T Ta 
jump update insn store VM reg 2 


Two long dependency chains, each including two loads: G 
load+shift+—load+—add+store. ~ 6r latency just from the loads, XS 
with ideal Instruction-Level Parallelism! In practice it will be worse. 29/71 
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(Direct-threaded) VM add: operation dependency graph 





“a — b” means that a uses the result of b, so b is executed before 
a. Thick arrows mean high latencies (~ 37). 


[Register index shifts shown, offset sums to regs base not shown] 


load idx 0 load idx 1 
shift idx 0 shift idx 1 
I | 
load VM reg 0 load VM reg 1 load idx 2 
load target add shift idx 2 
T m 
jump update insn store VM reg 2 


Two long dependency chains, each including two loads: ( 
load<—shift—load—add+-store. ~ 6r latency just from the loads, XS 
with ideal Instruction-Level Parallelism! In practice it will be worse. 29/71 
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(Direct-threaded) VM b: operation dependency graph 





load instr 


t 


load code pointer 


n 


jump 


Longest (and only) dependency chain load+—load+—jump. A VM 
unconditional branch has latency similar to a VM add; a VM b can 
easily be faster than a VM add if the hardware branch target 
predictor does its job. 
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(Direct-threaded) VM b: operation dependency graph 





load instr 


t 


load code pointer 


n 


jump 


Longest (and only) dependency chain load+—load+—jump. A VM 
unconditional branch has latency similar to a VM add; a VM b can 
easily be faster than a VM adad if the hardware branch target 
predictor does its job. 


VMs and hardware machines can have very different performance 
profiles. 
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(Direct-threaded) VM b: operation dependency graph 





load instr 


t 


load code pointer 


n 


jump 


Longest (and only) dependency chain load+—load+—jump. A VM 
unconditional branch has latency similar to a VM add; a VM b can 
easily be faster than a VM adad if the hardware branch target 
predictor does its job. 


VMs and hardware machines can have very different performance 
profiles. 


[I've understood, too late to make the change before the GHM, that this is 


optimizable. Can you see how? Hint: b can have two arguments instead of 





one, at least in the memory representation of the program.] 
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(Direct-threaded) VM b: operation dependency graph 


load instr 


t 


load code pointer 


n 


jump 


Longest (and only) dependency chain load+—load jump. A VM 
unconditional branch has latency similar to a VM add; a VM b can 
easily be faster than a VM add if the hardware branch target 
predictor does its job. 


VMs and hardware machines can have very different performance 
profiles. 


[I've understood, too late to make the change before the GHM, that this is 


optimizable. Can you see how? Hint: b can have two arguments instead of 





one, at least in the memory representation of the program.] 
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What if we used a stack 





Stack-oriented VM instructions replace the top few elements of a 
stack with the result of an operation. For example stack_add 
(zero arguments) could pop two elements (say, 5 and 6) from the 
stack and push their sum (11). 
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What if we used a stack 





Stack-oriented VM instructions replace the top few elements of a 
stack with the result of an operation. For example stack_add 
(zero arguments) could pop two elements (say, 5 and 6) from the 
stack and push their sum (11). This idea is about using stacks 
instead of VM registers, not just call stacks. 
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What if we used a stack 





Stack-oriented VM instructions replace the top few elements of a 
stack with the result of an operation. For example stack_add 
(zero arguments) could pop two elements (say, 5 and 6) from the 
stack and push their sum (11). This idea is about using stacks 
instead of VM registers, not just call stacks. 


The authors of [Shi et al., 2005], in other works as well, argue from 
experimental data that direct-threaded register VMs are faster than 
direct-threaded stack VMs (same model I'm presenting here, stack 
code machine-translated to VM-register code with optimizations). 
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What if we used a stack 





Stack-oriented VM instructions replace the top few elements of a 
stack with the result of an operation. For example stack_add 
(zero arguments) could pop two elements (say, 5 and 6) from the 
stack and push their sum (11). This idea is about using stacks 
instead of VM registers, not just call stacks. 


The authors of [Shi et al., 2005], in other works as well, argue from 
experimental data that direct-threaded register VMs are faster than 
direct-threaded stack VMs (same model I'm presenting here, stack 
code machine-translated to VM-register code with optimizations). 


Unfortunately it’s difficult to replicate their measurements. 

| wonder if their results still hold today, with our proportionally 
slower L1d caches and better branch predictors. [Still, stack code 
takes more instructions to do the same work, today like in 2005] 
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What if we used a stack 





Stack-oriented VM instructions replace the top few elements of a 
stack with the result of an operation. For example stack_add 
(zero arguments) could pop two elements (say, 5 and 6) from the 
stack and push their sum (11). This idea is about using stacks 
instead of VM registers, not just call stacks. 


The authors of [Shi et al., 2005], in other works as well, argue from 
experimental data that direct-threaded register VMs are faster than 
direct-threaded stack VMs (same model I'm presenting here, stack 
code machine-translated to VM-register code with optimizations). 


Unfortunately it’s difficult to replicate their measurements. 

| wonder if their results still hold today, with our proportionally 
slower L1d caches and better branch predictors. [Still, stack code 
takes more instructions to do the same work, today like in 2005] 


This is why | have doubts: 
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Naive stack implementation 


Suppose the VM has a stack in a hardware memory array, with a 
top-of-stack pointer in a hardware register. This is a zero-argument 
stack_add VM instruction: 







label_stack_add: 
top [-1] = top [-1] + top [0]; 
top --; 

/* Fallthru code omitted, same as always. */ 
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Naive stack implementation 


Suppose the VM has a stack in a hardware memory array, with a 
top-of-stack pointer in a hardware register. This is a zero-argument 
stack_add VM instruction: 







label_stack_add: 
top [-1] = top [-1] + top [0]; 
top --; 

/* Fallthru code omitted, same as always. */ 
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Naive stack implementation 


Suppose the VM has a stack in a hardware memory array, with a 
top-of-stack pointer in a hardware register. This is a zero-argument 
stack_add VM instruction: 









label_stack_add: 
top [-1] = top [-1] + top [0]; 
top --; 

/* Fallthru code omitted, same as always. 


Before: After: 


















































top —+— 5 top 5 
6 a 11 


























Two (independent) loads, one store. GY 
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Naive stack implementation 


Suppose the VM has a stack in a hardware memory array, with a 
top-of-stack pointer in a hardware register. This is a zero-argument 
stack_add VM instruction: 







label_stack_add: 
top [-1] = top [-1] + top [0]; 
top --; 

/* Fallthru code omitted, same as always. */ 


















































top —+—— 5 top 5 
6 a 11 


























ş 


Two (independent) loads, one store. This looks better than our K 
VM-register add: constant offsets from top, no index/offset loads. 32/71 
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Top-Of-Stack (TOS) optimization 


We can do even better: keep the VM top stack element in a 
hardware register (the rest of the stack still in a hardware memory 
array), and an under-top pointer in a second hardware register. 







label_stack_add: 
tos += undertop [0]; 
undertop --; 

/* Fallthru code omitted, same as always. */ 


Before: 











tos = 5 


undertop +—_, 6 
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Top-Of-Stack (TOS) optimization 


We can do even better: keep the VM top stack element in a 
hardware register (the rest of the stack still in a hardware memory 
array), and an under-top pointer in a second hardware register. 







label_stack_add: 
tos += undertop [0]; 
undertop --; 

/* Fallthru code omitted, same as always. 


*/ 

































































Before: After: 
tos = 5 tos = 11 
undertop +—_, undertop 
6 sho 6 
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Top-Of-Stack (TOS) optimization 


We can do even better: keep the VM top stack element in a 
hardware register (the rest of the stack still in a hardware memory 
array), and an under-top pointer in a second hardware register. 







label_stack_add: 
tos += undertop [0]; 
undertop --; 

/* Fallthru code omitted, same as always. 


*/ 



















































































Before: After: 
tos = 5 tos = 11 
undertop +—_, undertop 
6 sho 6 
Only one load. ay) 





Luca Saiu http: //ageinghacker.net The art of the language VM — GNU Hackers’ Meeting 2017 


Basics xample AS ear swi gs Stacks 





Top-Of-Stack (TOS) optimizatio 


We can do even better: keep the VM top stack element in a 
hardware register (the rest of the stack still in a hardware memory 
array), and an under-top pointer in a second hardware register. 







label_stack_add: 
tos += undertop [0]; 
undertop --; 

/* Fallthru code omitted, same as always. 


*/ 

































































Before: After: 
tos = 5 tos = 11 
undertop +—— undertop 
6 ia 6 




















Only one load. Other VM instructions working only on the TOS K 
(for example stack_increment) require zero loads. i 


33/71 
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(Direct-threaded) TOS-optimized stack_add: operations 


This includes the fallthru operations (update insn, load target, 


jump). 
load target load under-top 
jump update insn update upder-top ptr. add 
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(Direct-threaded) TOS-optimized stack_add: operations 





This includes the fallthru operations (update insn, load target, 
jump). 


load target load under-top 
jump update insn update upder-top ptr. add 


Very ‘“flat’-looking graph with short dependency chains (max _ 
length 1). Not many operations. K 





Luca Saiu http://ageinghacker.net The art of the language VM — GNU Hackers’ Meeting 2017 


Basics ] 1ea Th j Args Stacks 


(Direct-threaded) TOS-optimized stack_add: operations 





This includes the fallthru operations (update insn, load target, 


jump). 
load target load under-top 
jump update-insn update upder-top ptr. add 


Very ‘“flat’-looking graph with short dependency chains (max 
length 1). Not many operations. 
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You've seen what every simple VMs does... 


Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 
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You've seen what every simple VMs does... 


Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 


@ optimize VM register (and immediate argument) access [new] 
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You've seen what every simple VMs does... 


Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 
@ optimize VM register (and immediate argument) access [new] 


@ optimize fallthru [I learned the idea from [Ertl and Gregg, 2004], which 


builds upon previous work] 
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You've seen what every simple VMs does... 





Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 
@ optimize VM register (and immediate argument) access [new] 


ə optimize fallthru [I learned the idea from [Ertl and Gregg, 2004], which 
builds upon previous work] 


@ remove insn and the VM program in memory [conceputally easy] 
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You've seen what every simple VMs does... 


Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 
@ optimize VM register (and immediate argument) access [new] 


@ optimize fallthru [I learned the idea from [Ertl and Gregg, 2004], which 
builds upon previous work] 


@ remove insn and the VM program in memory [conceputally easy] 
ə optimize VM branches [my technique is new] 
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You've seen what every simple VMs does... 


Nothing of what you saw up to here is new except for the removal 
of register index shifts, a minor optimization. 


| want to make my VMs faster. In order of priority | need to: 
@ optimize VM register (and immediate argument) access [new] 


ə optimize fallthru [I learned the idea from [Ertl and Gregg, 2004], which 
builds upon previous work] 


@ remove insn and the VM program in memory [conceputally easy] 
ə optimize VM branches [my technique is new] 


| want zero overhead in the common cases. 
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Optimizing VM register access 





VM registers should not be in hardware memory. 
| want them in hardware registers (as long as they fit). 
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Optimizing VM register access 


VM registers should not be in hardware memory. 
| want them in hardware registers (as long as they fit). 


The problem: every time | do anything with 
regs[e] 


and the value of e isn’t known at compile time | lose. GCC can’t 
put any regs element in a specific hardware register, while there is 
even one regs [e] expression with unknown e — reading or 
writing. 
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Optimizing VM register access 


VM registers should not be in hardware memory. 
| want them in hardware registers (as long as they fit). 


The problem: every time | do anything with 
regs[e] 


and the value of e isn’t known at compile time | lose. GCC can’t 
put any regs element in a specific hardware register, while there is 
even one regs [e] expression with unknown e — reading or 
writing. 


The solution: never use regs[e] with a non-constant e; or even 


split regs into scalar variables reg_0, reg_1, reg_2, ... and never 
take the address of those variables: writing “& regs_i” is forbidden GA 
for every i. 
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Let's look at a VM instruction such as add 





[Here with register indices rather than offsets, just for simplicity: same point] 








label_add: 
regs [insn[3].i] = regs[insn[1].i] + regs[insn[2] .i]; 
insn += 4; 

goto * insn->label; 
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Specialization C atorial explosion 


Let's look at a VM instruction such as add 





[Here with register indices rather than offsets, just for simplicity: same point] 








label_add: 
regs[insn[3].i] = regs[insn[1].i] + regs[insn[2].i]; 
insn += 4; 

goto * insn->label; 


Here regs is (always) indexed with insn[k].i, an index coming 
from the interpreted program! 
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Let's look at a VM instruction such as add 





[Here with register indices rather than offsets, just for simplicity: same point] 








label_add: 
regs[insn[3].i] = regs[insn[1].i] + regs[insn[2].i]; 

insn += 4; 

goto * insn->label; 





Here regs is (always) indexed with insn[k].i, an index coming 
from the interpreted program! 


And this pattern is very common across VM instructions. 
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Let's look at a VM instruction such as add 





[Here with register indices rather than offsets, just for simplicity: same point] 











label_add: 
regs[insn[3].i] = regs[insn[1].i] + regs[insn[2] .i]; 

insn += 4; 

goto * insn->label; 





Here regs is (always) indexed with insn[k].i, an index coming 
from the interpreted program! 


And this pattern is very common across VM instructions. 


No hope with this VM instruction code. Y 
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VM instruction specialization 





A radical solution: forbid register indices/offsets as VM instruction 
arguments. 
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VM instruction specialization 





A radical solution: forbid register indices/offsets as VM instruction 
arguments. 


Remove the VM instruction add taking three index/offsets 
arguments from the interpreter. Instead there will be many 
specialized VM instructions: 


add/%r0/%r0//r0, 

add/%r0/hr0/%r1, 

add/%r0/hr1/%r0, 

add/frO/pri1 7 p91). oc: 

add/fri/%r1/hri, 

add/%r0//r0/%r2, ... Every possible combination. 
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VM instruction specialization 





A radical solution: forbid register indices/offsets as VM instruction 
arguments. 


Remove the VM instruction add taking three index/offsets 
arguments from the interpreter. Instead there will be many 
specialized VM instructions: 


add/%r0/%r0//r0, 

add/%r0/hr0/%r1, 

add/%r0/hr1/%r0, 

add/frO/pri1 7 p91). oc: 

add/fri/%r1/hri, 

add/%r0//r0/%r2, ... Every possible combination. 


Specialized instructions have no register-index/offset arguments; 
the specializations of our example’s add have all zero arguments. 
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Bear with me 





Yes, | know that you have objections at this point. 


Please give me one minute. | will address them. 
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Where am | going? 


Specialization is not manageable in human-written code: 
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Where am | going? 





Specialization is not manageable in human-written code: 
ə very long and redundant code 


ə fragile with respect to trivial details [how many programs slot 
to skip for fallthru? The number depends on how many 
arguments are VM registers] 
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Where am | going? 





Specialization is not manageable in human-written code: 
ə very long and redundant code 


ə fragile with respect to trivial details [how many programs slot 
to skip for fallthru? The number depends on how many 
arguments are VM registers] 


The solution is machine-generating C code. 
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The project takes shape 


The new software I'm presenting is a code generator, automatically 
emitting C code for a VM from a human-written specification. Like 
Bison, and even more like Vmgen [Ertl et al., 2002], [Ertl, 2008]. 
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The project takes shape 


The new software I’m presenting is a code generator, automatically 
emitting C code for a VM from a human-written specification. Like 
Bison, and even more like Vmgen [Ertl et al., 2002], [Ertl, 2008]. 


@ user-provided C code snippets for each unspecialized 
instruction 
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The project takes shape 


The new software l'm presenting is a code generator, automatically 
emitting C code for a VM from a human-written specification. Like 
Bison, and even more like Vmgen [Ertl et al., 2002], [Ertl, 2008]. 


@ user-provided C code snippets for each unspecialized 
instruction 


@ convenient automatically-defined CPP macros to refer to 
(pre-specialization) arguments, and more 
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The project takes shape 


The new software I'm presenting is a code generator, automatically 
emitting C code for a VM from a human-written specification. Like 
Bison, and even more like Vmgen [Ertl et al., 2002], [Ertl, 2008]. 


@ user-provided C code snippets for each unspecialized 
instruction 

@ convenient automatically-defined CPP macros to refer to 
(pre-specialization) arguments, and more 

ə fallthru code implicit for every VM instruction, automatically 
added by the generator 
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The project takes shape 


The new software l'm presenting is a code generator, automatically 
emitting C code for a VM from a human-written specification. Like 
Bison, and even more like Vmgen [Ertl et al., 2002], [Ertl, 2008]. 


@ user-provided C code snippets for each unspecialized 
instruction 


@ convenient automatically-defined CPP macros to refer to 
(pre-specialization) arguments, and more 


ə fallthru code implicit for every VM instruction, automatically 
added by the generator 


A VM instruction specification from the “Uninspired” VM (edited) 


instruction add (?R, ?R, !R) 
code 
UNINSPIRED_ARGN2 = UNINSPIRED_ARGNO + UNINSPIRED_ARGN1; 
end 
end 
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Generated C code: 





Making VMs general: 


ə VM registers, or stacks (TOS-optimized or not), both, 
anything else implemented by the user 
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Generated C code: 





Making VMs general: 


ə VM registers, or stacks (TOS-optimized or not), both, 
anything else implemented by the user 


@ user-specified data types (register classes: for example 
integer/pointer, floating point, vector, ...) 
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Generated C code: 





Making VMs general: 


ə VM registers, or stacks (TOS-optimized or not), both, 
anything else implemented by the user 


ə user-specified data types (register classes: for example 
integer/pointer, floating point, vector, ...) 
ə several possible dispatching models 


ə switch-dispatching, direct threading, other models I'll show 
later; 
ə different performance profiles, identical behavior! 
ə lots of #ifdefs in the generated C code; choose dispatching 
model by compiling with -DDIRECT_THREADING, ... 
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Generated C code: 





Making VMs general: 


ə VM registers, or stacks (TOS-optimized or not), both, 
anything else implemented by the user 


ə user-specified data types (register classes: for example 
integer/pointer, floating point, vector, ...) 
ə several possible dispatching models 
ə switch-dispatching, direct threading, other models I'll show 
later; 
ə different performance profiles, identical behavior! 
ə lots of #ifdefs in the generated C code; choose dispatching 
model by compiling with -DDIRECT_THREADING, ... 


@ include custom C code from the user 
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Generated C code: 





Making VMs general: 


ə VM registers, or stacks (TOS-optimized or not), both, 
anything else implemented by the user 


ə user-specified data types (register classes: for example 
integer/pointer, floating point, vector, ...) 
ə several possible dispatching models 


ə switch-dispatching, direct threading, other models I'll show 
later; 


ə different performance profiles, identical behavior! 
ə lots of #ifdefs in the generated C code; choose dispatching 
model by compiling with -DDIRECT_THREADING, ... 


@ include custom C code from the user 


ə compatible with multi-threading and garbage collection, 
including exact pointer-finding [not just conservative as in Hans 
Bohem’s GC] 
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Generated C code: 





Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 
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Generated C code: 





Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 
ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 
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Generated C code: 





Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 
ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 


ə even that little assembly is optional, only for better 
performance 
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Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 
ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 


ə even that little assembly is optional, only for better 
performance 
ə VMs behave identically, with or without assembly support 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Specialization C generation ( 


Generated C code: 





Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 

ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 

ə even that little assembly is optional, only for better 
performance 

ə VMs behave identically, with or without assembly support 
e direct threading with specialization is as portable as GCC 
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Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 


ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 

ə even that little assembly is optional, only for better 
performance 


ə VMs behave identically, with or without assembly support 

e direct threading with specialization is as portable as GCC 

ə switch-dispatching even more portable (no goto *) (not yet 
implemented, but trivial) 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Specialization C generation ( 


Generated C code: 





Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 


ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 

ə even that little assembly is optional, only for better 
performance 

ə VMs behave identically, with or without assembly support 

e direct threading with specialization is as portable as GCC 

ə switch-dispatching even more portable (no goto *) (not yet 
implemented, but trivial) 


@ compiled VMs work comfortably even on “small” machines 
(32MB RAM is plenty; probably 8 or even 4MB is enough) 
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Making VMs portable with respect to different CPU architectures 
(also important for political reasons: free hardware as a prerequisite 
for privacy) 

ə Using C with as little assembly as possible, and not in user 
code (the assembly part is VM-independent, and already 
provided) 

ə even that little assembly is optional, only for better 
performance 

ə VMs behave identically, with or without assembly support 
e direct threading with specialization is as portable as GCC 


ə switch-dispatching even more portable (no goto *) (not yet 
implemented, but trivial) 


@ compiled VMs work comfortably even on “small” machines 
(32MB RAM is plenty; probably 8 or even 4MB is enough) 


ə (Compiling VMs is heavier, as you have guessed already) 
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Generated-code goodies 


Along with the generated code you get: 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 
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Generated-code goodies 


Along with the generated code you get: 
ə C API for dynamically generating and executing VM programs 
from your application 
ə driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 


ə driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 


ə frontend: VM program parser and printer 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 


ə driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 


ə frontend: VM program parser and printer 


@ cross-compilation support 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 


driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 


frontend: VM program parser and printer 


cross-compilation support 


disassembly to native or (via qemu-user) cross- code 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 

driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 


frontend: VM program parser and printer 


cross-compilation support 


disassembly to native or (via qemu-user) cross- code 


testsuite (even cross-, via qemu-user) 
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Generated-code goodies 


Along with the generated code you get: 


ə C API for dynamically generating and executing VM programs 
from your application 

driver with command-line options (main with convenient GNU 
command-line support for debugging and benchmarking) 


frontend: VM program parser and printer 


cross-compilation support 


disassembly to native or (via qemu-user) cross- code 


testsuite (even cross-, via qemu-user) 


If you want to generate a direct-threaded VM of the kind many 
projects already use (Emacs, Guile, ...), it’s trivial. But getting a 
much more efficient VM is not any more difficult. 
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VM specialized instructions: combinatorial explosion: 


If we have n registers and m instructions (for example) all taking 3 
register indices as arguments, specialized instructions are m- n°. 
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VM specialized instructions: combinatorial explosion: 


If we have n registers and m instructions (for example) all taking 3 
register indices as arguments, specialized instructions are m- n°. 


Yes, there are practical limits on how many VM registers of this 
kind you can have. 


There are ways to reduce this growth and some optimizations | 
haven't implemented yet, but compiling a machine-generated VM is 
heavy. GCC can use GBs of RAM and take minutes to run when 
VM registers are many. 
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Limiting combinatorial explosion 


Some specialized instructions are useless or can be normalized: 
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Limiting combinatorial explosion 


Some specialized instructions are useless or can be normalized: 

ə For example, addition is commutative: add/%r0/%ri/%r2 and 
add/%r1/%r0/%r2 do the same work, and we can keep only 
one. This halves the number of (commutative) specialized 
instructions. 
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Limiting combinatorial explosion 


Some specialized instructions are useless or can be normalized: 

ə For example, addition is commutative: add/%r0/%ri/%r2 and 
add/%r1/%r0/%r2 do the same work, and we can keep only 
one. This halves the number of (commutative) specialized 
Instructions. 

ə We can also rewrite every specialized instruction such as 

add/%ri/hxrj/hrk 
into a two-specialized-instruction sequence 
copy/4rj/hrk 
add/%ri/hrk/hrk 
whenever j # k. [This is correct because add writes its third 
argument, but doesn’t read it.] This rewrite can cut the 


number of specialized instructions from m:n? to m- n°. 
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Limiting combinatorial explosion 


Some specialized instructions are useless or can be normalized: 

ə For example, addition is commutative: add/%r0/%ri/%r2 and 
add/%r1/%r0/%r2 do the same work, and we can keep only 
one. This halves the number of (commutative) specialized 
Instructions. 

ə We can also rewrite every specialized instruction such as 

add/%ri/hxrj/hrk 
into a two-specialized-instruction sequence 
copy/4rj/hrk 
add/%ri/hrk/hrk 
whenever j # k. [This is correct because add writes its third 
argument, but doesn’t read it.] This rewrite can cut the 
number of specialized instructions from m:n? to m- n°. 


Every specialized instruction which is not a rewrite target “doesn't € 
exist”. 
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Limiting combinatorial explosion: rewriting 





What I've outlined can be expressed as a rewriting system. 
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Limiting combinatorial explosion: 


What I've outlined can be expressed as a rewriting system. 


Which rewrites are valid depends on the properties of each specific 
instruction: such properties must be declared by the user in her VM 
specification, and cannot in general be inferred. 
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Limiting combinatorial explosion: 


What I've outlined can be expressed as a rewriting system. 


Which rewrites are valid depends on the properties of each specific 
instruction: such properties must be declared by the user in her VM 
specification, and cannot in general be inferred. 


I've not fully implemented rewriting yet, even if the parser 
recognizes a preliminary syntax. | want a rule-based system which 
is expressive enough to limit growth, and also to perform a few 
optimizations in the VM program [for this reason | will implement 


rewriting on unspecialized VM instructions] 
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Limiting combinatorial explosion: 


What I've outlined can be expressed as a rewriting system. 


Which rewrites are valid depends on the properties of each specific 
instruction: such properties must be declared by the user in her VM 
specification, and cannot in general be inferred. 


I've not fully implemented rewriting yet, even if the parser 
recognizes a preliminary syntax. | want a rule-based system which 
is expressive enough to limit growth, and also to perform a few 
optimizations in the VM program [for this reason | will implement 


rewriting on unspecialized VM instructions] 


Some manual tests have convinced me that with fewer useless VM 
instructions GCC will do a better job of allocating registers for 
those which remain. Implementing rewriting is high-priority. 

[GCC register allocation gets worse with many VM registers, on most but not 


all architectures. Is there a GCC expert | can talk to here?] 
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Combinatorial explosion and T based instructions 


Do we have the same combinatorial explosion problem with 
stack-based instruction? 
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Combinatorial explosion and Ree based instructions 


Do we have the same combinatorial explosion problem with 
stack-based instruction? 


ə No. The unspecialized VM instruction add_stack has zero 
arguments, and only one specialization. 
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Combinatorial explosion and Ree based instructions 


Do we have the same combinatorial explosion problem with 
stack-based instruction? 
ə No. The unspecialized VM instruction add_stack has zero 
arguments, and only one specialization. 
ə More in general implied operands limit combinatorial explosion, 
even with registers. Example: special-purpose registers: mul 
and div could always write to the same destination register .. . 
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Combinatorial explosion and Ree based instructions 


Do we have the same combinatorial explosion problem with 
stack-based instruction? 
ə No. The unspecialized VM instruction add_stack has zero 
arguments, and only one specialization. 
ə More in general implied operands limit combinatorial explosion, 
even with registers. Example: special-purpose registers: mul 
and div could always write to the same destination register .. . 


ə Rewrite rules are an easy and powerful way of optimizing stack 


code. 
Example: 
stack_push 10 
stack_plus 
> 


stack_plusi 10 


We'll see how effective this is after | implement rewriting. 
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Specialization 


n Performance 


Is VM specialization worth the ebie? 





Remove every access to regs with a non-constant index from the 


interpreter. Then: 


(Macro-expanded) GNU C 







label_add_rO_ri_r1: 
regs[1] = regs[0] + regs[1]; 

insn ++; // skip code ptr. only 

goto * insn->label; 





Now regs indices are constants 
(different in every specialization): 
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Is VM specialization worth the E 


Remove every access to regs with a non-constant index from the 
interpreter. Then: 






(Macro-expanded) GNU C 







compiled (x86_64) 


addq $8, %rax 
addq %rbx, %rcx 
jmpq *(%rax) 







label_add_rO_ri_ri: 
regs[1] = regs[0] + regs[1]; 

insn ++; // skip code ptr. only 
goto * insn->label; 





# Jump via memory 


Much better than the 


Now regs indices are constants haha ; 
unspecialized version! 


(different in every specialization): 
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Is VM specialization worth the trouble: 





Remove every access to regs with a non-constant index from the 
interpreter. Then: 






(Macro-expanded) GNU C 


compiled (x86_64) 


addq $8, %rax 
addq %rbx, %rcx 
jmpq *(hrax) 












label_add_rO_ri_ri: 
regs[1] = regs[0] + regs[1]; 

insn ++; // skip code ptr. only 
goto * insn->label; 





# Jump via memory 


Much better than the 


Now regs indices are constants ia ; 
unspecialized version! 


(different in every specialization): 


Here GCC has kept the VM register %r0 in the hardware register 
%rbx and the VM register %r1 in the hardware register %rcx. 
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Is VM specialization worth the trouble: 


Remove every access to regs with a non-constant index from the 
interpreter. Then: 






(Macro-expanded) GNU C 


compiled (x86_64) 


addq $8, %rax 
addq %rbx, %rcx 
jmpq *(/4rax) # Jump via memory 












label_add_rO_ri_ri: 
regs[1] = regs[0] + regs[1]; 

insn ++; // skip code ptr. only 
goto * insn->label; 






Much better than the 


Now regs indices are constants hak ; 
unspecialized version! 


(different in every specialization): 


Here GCC has kept the VM register %r0 in the hardware register 
%rbx and the VM register %r1 in the hardware register %rcx. 


[When there aren't enough hardware machine registers GCC will 
allocate some VM registers on the C stack, at a known offset from GN 
the C stack/frame pointer: still faster than without specialization.] -` 


49/71 
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More on specialization: slow VM iers 





There’s a limit to the number of VM registers we can use for 
generating specialized instruction. However, for convenience and 
expressiveness, we can also, optionally, provide an unlimited number 
of additional VM registers, less efficient to access. 





50/71 
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More on specialization: slow VM registers 


There’s a limit to the number of VM registers we can use for 
generating specialized instruction. However, for convenience and 
expressiveness, we can also, optionally, provide an unlimited number 
of additional VM registers, less efficient to access. 


We call the VM registers on which we specialize fast registers, and 
the others slow registers. Slow registers are implemented as a 
(separate) array in hardware memory, exactly like pre-specialization 
VM registers, pointed by slow_regs. 
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More on specialization: slow VM Pe iers 





There's a limit to the number of VM registers we can use for 
generating specialized instruction. However, for convenience and 
expressiveness, we can also, optionally, provide an unlimited number 
of additional VM registers, less efficient to access. 


We call the VM registers on which we specialize fast registers, and 
the others slow registers. Slow registers are implemented as a 
(separate) array in hardware memory, exactly like pre-specialization 
VM registers, pointed by slow_regs. 


The distinction between fast and slow registers is transparent: 


A VM instruction specification from the “Uninspired” VM (edited) 


instruction add (?R, ?R, !R) # Each ‘R?’ can be fast or slow 
code 
UNINSPIRED_ARGN2 = UNINSPIRED_ARGNO + UNINSPIRED_ARGN1; 
end 
end 
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Slow VM registers: generated code expansion 


The same VM instruction can indifferently use fast or slow VM 
registers, or mix them together, according to each specialization: 






(Macro-expanded) GNU C 







label_add_rO_rR_rO: 
regs[0] = regs[0] + (* (long *) (slow_regs + insn[1].i)); 

insn += 2; // skip code ptr. and the residual slow_regs offt. 
goto * insn->label; 
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Slow VM registers: generated code expansion 


The same VM instruction can indifferently use fast or slow VM 
registers, or mix them together, according to each specialization: 






(Macro-expanded) GNU C 







label_add_r0_rR_r0: 
regs[0] = regs[0] + (* (long *) (slow_regs + insn[1].i)); 

insn += 2; // skip code ptr. and the residual slow_regs offt. 

goto * insn->label; 





The generator always encodes slow VM register arguments as 
pre-shifted offsets from slow_regs within the VM program (here 
insn[1].i). 
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Slow VM registers: generated code expansion 


The same VM instruction can indifferently use fast or slow VM 
registers, or mix them together, according to each specialization: 






(Macro-expanded) GNU C 


label_add_r0_rR_r0: 
regs[0] = regs[0] + (* (long *) (slow_regs + insn[1].i)); 

insn += 2; // skip code ptr. and the residual slow_regs offt. 
goto * insn->label; 







The generator always encodes slow VM register arguments as 
pre-shifted offsets from slow_regs within the VM program (here 
insn[1].i). 


Reading a VM slow register value still takes two inter-dependent 


loads. GN 
AF 
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More on specialization: literals 


We can specialize on a set of particular instruction literal arguments 
as well. 
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More on specialization: literals 





We can specialize on a set of particular instruction literal arguments 
as well. 


The same instruction can also be made to access either a register 
or a literal at some position. For example adding 1 and -1 to a VM 
register is presumably common: 
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More on specialization: literals 





We can specialize on a set of particular instruction literal arguments 
as well. 


The same instruction can also be made to access either a register 
or a literal at some position. For example adding 1 and -1 to a VM 
register is presumably common: 


VM instruction specification from the “Uninspired” VM 
instruction add (?Rn 1 -1, ?Rn 1 -1, !R) 
code 
UNINSPIRED_ARGN2 = UNINSPIRED_ARGNO + UNINSPIRED_ARGN1; 
end 
end 
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More on specialization: literals 





We can specialize on a set of particular instruction literal arguments 
as well. 


The same instruction can also be made to access either a register 
or a literal at some position. For example adding 1 and -1 to a VM 
register is presumably common: 


VM instruction specification from the “Uninspired” VM 
instruction add (?Rn 1 -1, ?Rn 1 -1, !R) 
code 
UNINSPIRED_ARGN2 = UNINSPIRED_ARGNO + UNINSPIRED_ARGN1; 
end 
end 








Specialized literals are not held in the VM program (not 
“residualized’’). 
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More on specialization: literals performance 


Now regs indices are constant, and literal constants are substituted 
into the VM instruction code in C. 


NRE ECESE Eo) 


label_addi_r0_n1_r0: 
regs[0] = regs[0] + 1; 

insn ++; // skip code ptr. only 
goto * insn->label; 
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More on specialization: literals performance 


Now regs indices are constant, and literal constants are substituted 
into the VM instruction code in C. 


compiled (x86_64) 


addq $8, %rax 
addq $1, %rbx 
jmpq * (hrax) 






GNU C (Macro-expanded) 


label_addi_r0_n1_r0: 
regs[0] = regs[0] + 1; 

insn ++; // skip code ptr. only 
goto * insn->label; 







# Jump via memory 


Good! 
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More on specialization: literals performance 


Now regs indices are constant, and literal constants are substituted 
into the VM instruction code in C. 


compiled (x86_64) 


addq $8, %rax 
addq $1, %rbx 
jmpq *(%rax) 






GNU C (Macro-expanded) 


label_addi_r0_n1_r0: 
regs[0] = regs[0] + 1; 

insn ++; // skip code ptr. only 
goto * insn->label; 







# Jump via memory 
Good! 


Here GCC emitted $1 as a hardware instruction immediate. This 
code reads L1d only in the fallthru part. 
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More on specialization: literals performance 


Now regs indices are constant, and literal constants are substituted 
into the VM instruction code in C. 


compiled (x86_64) 


addq $8, %rax 
addq $1, %rbx 
jmpq *(%rax) 






GNU C (Macro-expanded) 


label_addi_r0_n1_r0: 
regs[0] = regs[0] + 1; 

insn ++; // skip code ptr. only 
goto * insn->label; 







# Jump via memory 
Good! 


Here GCC emitted $1 as a hardware instruction immediate. This 
code reads L1d only in the fallthru part. 


[Always possible with small constants on most architectures] 
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VM operand access is now fast in the common case 
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VM operand access is now fast in the common case 


[Little demo of the Uninspired VM, with direct-threaded 
dispatching and specialization for fast operand access] 
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Replication 


The next bottleneck 





We have solved the problem of operand access in the common case. 


The interpreter bottleneck has moved: now the problem is 
dispatching. 
ə the fallthru code at the end of the typical VM instruction now 
takes longer than the part doing useful work. 
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Replication 


The next bottleneck 





We have solved the problem of operand access in the common case. 


The interpreter bottleneck has moved: now the problem is 
dispatching. 
@ the fallthru code at the end of the typical VM instruction now 
takes longer than the part doing useful work. 
ə VM branches are less common than falling thru in real-world 
programs (the down-counter example is not representative) 
ə ...s0 let’s not think about VM branches yet 
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VM instruction replication 





All VM instructions but unconditional branches end with slow 
fallthru code. We want to remove it. 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Replication Replication Cha 





VM instruction replication 


All VM instructions but unconditional branches end with slow 
fallthru code. We want to remove it. 


The solution is copying compiled specialized VM instruction code 
sequences one after another, concatenating them into hardware 
machine-code basic blocks. Then each VM instruction in the block 
automatically “falls thru” into the next. 
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VM instruction replication 


All VM instructions but unconditional branches end with slow 
fallthru code. We want to remove it. 


The solution is copying compiled specialized VM instruction code 
sequences one after another, concatenating them into hardware 
machine-code basic blocks. Then each VM instruction in the block 
automatically “falls thru” into the next. 


A code pointer is only needed at the beginning of each basic block. 
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VM instruction replication 


All VM instructions but unconditional branches end with slow 
fallthru code. We want to remove it. 


The solution is copying compiled specialized VM instruction code 
sequences one after another, concatenating them into hardware 
machine-code basic blocks. Then each VM instruction in the block 
automatically “falls thru” into the next. 


A code pointer is only needed at the beginning of each basic block. 


| call this dispatching style minimal threading: it’s an optimization 
of direct threading. 
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VM instruction replication example 
















































































VM %ro Compiled hardware machine 
VM Yr code for !BEGINBASICBLOCK 
VM %r2 notes eee pei 
code for sub/%r1/%r0/%r 
insn a, : 7 
1 y Compiled hardware machine 
E hese code for add/%r0/nR/%r1 
| a Compiled hardware machine 
VM %r3 — code for b/1R 
———————— 
VM %r4 : 
Ae Next compiled basic block 
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VM instruction replication example 
















































































VM %ro Compiled hardware machine 
VM %ri code for !BEGINBASICBLOCK 
VM %r2 Compiled hardware machine 
(a 
code for sub/%r1/hr0/4r1 
TE | - - 
1 Ul Compiled hardware machine 
neon code for add/%r0/nR/%rt 
| 7 Compiled hardware machine 
VM %r3 ee code for b/1R 
Doo oo oo oH 
VM r4 3 7 
— Next compiled basic block 




















!BEGINBASICBLOCK only advances insn past the code pointer. 
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VM instruction replication example 
















































































VM %ro Compiled hardware machine 
VM %ri code for !BEGINBASICBLOCK 
VM %r2 Compiled hardware machine 
(a 
code for sub/%r1/hr0/4r1 
TE | - - 
1 Ul Compiled hardware machine 
neon code for add/%r0/nR/%r1 
| 7 Compiled hardware machine 
VM %r3 ee code for b/1R 
Doo oo oo oH 
VM %r4 3 7 
—— Next compiled basic block 




















!BEGINBASICBLOCK only advances insn past the code pointer. 
The sub/%r1/%r0/%r1 VM instruction doesn't touch Lid or even insn. 
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VM instruction replication example 







































































VM %r0 Compiled hardware machine 
WM %r1 code for !BEGINBASICBLOCK 
VM %r2 notes ee ae 
A E code 2 sub/%r1/%r0/%r1 l 
Compiled hardware machine 
E code for add/%r0/nR/%r1 
| = Compiled hardware machine 
VM %r3 aiis code for b/1R 
VM {r4 — aa nna 
—— Next compiled basic block 





























!BEGINBASICBLOCK only advances insn past the code pointer. 
The sub/%r1/%r0/%r1 VM instruction doesn't touch Lid or even insn. 


The add/%r0/nR/%r1 VM instruction has 7 as its residual literal argument on 
which it was not specialized. 
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VM instruction replication example 



















































































VM %ro Compiled hardware machine 
VM %ri code for !BEGINBASICBLOCK 
cee eee 
meL am 7 Compiled hardware machine 
pees code for add/%r0/nR/%r1 
| a Compiled hardware machine 
VM %r3 aia code for b/1R 
VM %r4 — a —_ 
—— Next compiled basic block 




















!BEGINBASICBLOCK only advances insn past the code pointer. 
The sub/%r1/%r0/%r1 VM instruction doesn't touch Lld or even insn. 
The add/%r0/nR/%r1 VM instruction has 7 as its residual literal argument on 


which it was not specialized. 


Branch target arguments are not specialized for: the internal VM-program 





pointer is b/1R's residual argument. 
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VM instruction replication challenges 


Replicating code by itself is not hard [but see Bruno's point on slide 60]: 
ə allocate executable memory with mmap 


@ copy machine code for VM specialized instructions into the 
executable space, delimited by label-as-value pointers. 





Luca Saiu http: //ageinghacker.net The art of the language VM — GNU Hackers’ Meeting 2017 


Replication Replication Challenges N 





VM instruction replication challenges 


Replicating code by itself is not hard [but see Bruno's point on slide 60]: 
@ allocate executable memory with mmap 


@ copy machine code for VM specialized instructions into the 
executable space, delimited by label-as-value pointers. 


We have to call GCC with the right options to prevent disasters: 
ə PC-relative memory accesses or calls. 
@ non-PIC code 
ə at least -fno-reorder-blocks, -fpic mandatory 


More subtly, GCC needs to keep its register-allocation compatible 
across the code for every VM specialized instruction. 
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VM instruction replication challenges 


Replicating code by itself is not hard [but see Bruno's point on slide 60]: 
@ allocate executable memory with mmap 


@ copy machine code for VM specialized instructions into the 
executable space, delimited by label-as-value pointers. 


We have to call GCC with the right options to prevent disasters: 
ə PC-relative memory accesses or calls. 
@ non-PIC code 
ə at least -fno-reorder-blocks, -fpic mandatory 

More subtly, GCC needs to keep its register-allocation compatible 


across the code for every VM specialized instruction. 


ə a few tricks: jumps (unreachable in replicated code) at the end 
of specialized instruction code, jumping to a C jump with a 
destination unknown to GCC (volatile, no-code inline asm 
with constraints). 
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More VM instruction replication challenges 


Global variable/function references are a problem (on most 
architecures), but given their names in C the generator can define 
macros to have them accessed thru a hidden stack-allocated 
structure — convenient for C code snippets. 


VM specification 


wrapped-globals 
printfixnum_format_string # String literals are dangerous! 
end 


wrapped-functions 
printf 
rand 
xmalloc 

end 





Since when replication is enabled we are already relying on another p 
GCC extension we can afford typeof as well in the generated code, K 
to free the user from the need of declaring types. Eoma 
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Replication Re ( lenges No assembly 


Minimal threading 





Minimal threading is delicate but requires no assembly (unless 
__builtin___clear_cache fails to invalidate L1i, as | saw happen 
on powerpc). 
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Minimal threading is delicate but requires no assembly (unless 
__builtin___clear_cache fails to invalidate L1i, as | saw happen 
on powerpc). 


Very portable: minimal threading is currently tested and working on 
aarch64, alpha, arm, i386, mips, powerpc, s390, sparc, x86 64 
(either endianness, either bitness) — and it probably works on 
many more architectures. It currently fails on sh4, which relies 
heavily on PC-relative loads. 
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on powerpc). 


Very portable: minimal threading is currently tested and working on 
aarch64, alpha, arm, i386, mips, powerpc, s390, sparc, x86 64 
(either endianness, either bitness) — and it probably works on 
many more architectures. It currently fails on sh4, which relies 
heavily on PC-relative loads. 


Minimal threading does require mmap, which isn't a problem on 
GNU systems. [After my talk Bruno Haible taught me a technique | didn’t 
know for working around the restrictions of W @ E systems, using two mmap 


mappings; still workable, but | admit that this will add some complexity] 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Replication Re ion Challenges No assembly 





Minimal threading 


Minimal threading is delicate but requires no assembly (unless 
__builtin___clear_cache fails to invalidate L1i, as | saw happen 


on powerpc). 


Very portable: minimal threading is currently tested and working on 
aarch64, alpha, arm, i386, mips, powerpc, s390, sparc, x86 64 
(either endianness, either bitness) — and it probably works on 
many more architectures. It currently fails on sh4, which relies 
heavily on PC-relative loads. 


Minimal threading does require mmap, which isn't a problem on 
GNU systems. [After my talk Bruno Haible taught me a technique | didn’t 
know for working around the restrictions of W @ E systems, using two mmap 


mappings; still workable, but | admit that this will add some complexity] 


A good dispatching model for most architectures. Where not 
supported (right now on sh4) the user can always revert to direct 
threading, lower-performance but as portable as GCC. 
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No-threading 


Next bottleneck: VM branches 





With minimal threading we have mostly [we still need to increment insn 
for VM instructions with residual arguments] eliminated fallthru overhead. 


The next bottleneck to eliminate is VM branching — fallthru 
overhead will also go away completely as a side effect. 
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for VM instructions with residual arguments] eliminated fallthru overhead. 


The next bottleneck to eliminate is VM branching — fallthru 
overhead will also go away completely as a side effect. 


ə All VM branching overhead comes from the direct-threading 
convention of having VM program slots contain pointers to 
executable code. 
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The next bottleneck to eliminate is VM branching — fallthru 
overhead will also go away completely as a side effect. 


ə All VM branching overhead comes from the direct-threading 
convention of having VM program slots contain pointers to 
executable code. 


ə Moreover residual literals and slow register offsets are also 
loaded from the VM program in memory, which is usually 
suboptimal. 
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No-threading 


Next bottleneck: VM branches 





With minimal threading we have mostly [we still need to increment insn 
for VM instructions with residual arguments] eliminated fallthru overhead. 


The next bottleneck to eliminate is VM branching — fallthru 
overhead will also go away completely as a side effect. 


ə All VM branching overhead comes from the direct-threading 
convention of having VM program slots contain pointers to 
executable code. 


ə Moreover residual literals and slow register offsets are also 
loaded from the VM program in memory, which is usually 
suboptimal. 


Why having the VM program as a data structure in memory at all? 
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No-threading Load literals in asm Branches Patch-in 


dispatch and and residual access 





Introducing the last and most efficient dispatching mode, no 
threading. 


The idea: do away with the VM problem as a data structure, and 
only keep the replicated executable code. 
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Introducing the last and most efficient dispatching mode, no 
threading. 


The idea: do away with the VM problem as a data structure, and 
only keep the replicated executable code. 


At this point we need some architecture-specific assembly code: 
ə Residual literals must be materialized into hardware registers 
or memory, since there is no program to load them from 
ə Small hand-written assembly routines, to be patched with 
literals... 
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No-threading Load literals in asm Branches Patch-in 


dispatch and and residual access 





Introducing the last and most efficient dispatching mode, no 
threading. 


The idea: do away with the VM problem as a data structure, and 
only keep the replicated executable code. 


At this point we need some architecture-specific assembly code: 
ə Residual literals must be materialized into hardware registers 
or memory, since there is no program to load them from 
ə Small hand-written assembly routines, to be patched with 


literals... 
ə ...copied before the beginning of each VM specialized 
instruction code needing residuals. 
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No-threading c 3 Branches P. 


branches 





Without the VM program there is no longer need for insn either — 
not even in a hardware register. 
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Without the VM program there is no longer need for insn either — 
not even in a hardware register. 


ə The VM instruction pointer is the same as the hardware 
instruction pointer (;rip on x86_ 64): native hardware 
branches! 
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Without the VM program there is no longer need for insn either — 
not even in a hardware register. 


ə The VM instruction pointer is the same as the hardware 
instruction pointer (%rip on x86_64): native hardware 
branches! 


e Branching via a hardware register is easy in GNU C and 
requires no assembly: goto *... 
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branches 





Without the VM program there is no longer need for insn either — 
not even in a hardware register. 


ə The VM instruction pointer is the same as the hardware 
instruction pointer (Arip on x86__ 64): native hardware 
branches! 


e Branching via a hardware register is easy in GNU C and 
requires no assembly: goto *... 


ə ... but that would be suboptimal: 


jmp L is usually faster than jmpq *%rax. 
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No-threading xes Patch-ins 





No-threading dispatch: label nents 


Label literals, as wide constants, are painful to load on RISCs and 
also force the CPU to jump thru a register or memory. 
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No-threading dispatch: label nents 


Label literals, as wide constants, are painful to load on RISCs and 
also force the CPU to jump thru a register or memory. 


ə We want to replace jumps in C code snippets with the 
appropriate hardware machine instructions—also in the 
conditional case. 





Luca Saiu http://ageinghacker.net The art of the language VM — GNU Hackers’ Meeting 2017 


No-threading € sm xes Patch-ins 





No-threading dispatch: label nents 


Label literals, as wide constants, are painful to load on RISCs and 
also force the CPU to jump thru a register or memory. 


ə We want to replace jumps in C code snippets with the 
appropriate hardware machine instructions—also in the 
conditional case. 

ə difficult, as jumps may occur anywhere within compiled C 
code. 
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No-threading dispatch: label nents 





Label literals, as wide constants, are painful to load on RISCs and 
also force the CPU to jump thru a register or memory. 

ə We want to replace jumps in C code snippets with the 
appropriate hardware machine instructions—also in the 
conditional case. 

ə difficult, as jumps may occur anywhere within compiled C 
code. 

ə solution: provide predefined macros VMPREFIX_BRANCH_FAST, 
VMPREFIX_BRANCH_FAST_IF_LESS_THAN, 
VMPREFIX_BRANCH_AND_LINK_FAST, ... 


expanding to patch-ins: 
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No-threading 


iS 





Every patch-in use generates an sequence of 0x0s in compiled 
code, of the right legnth for the missing hardware instruction(s) to 
be patched in — and add a pointer to the “hole” into a global table 
in a different assembly section, along with an id for the specialized 
instruction and the patch-in case (unconditional branch, branch-and-link, 


branch-if-less-than-zero. . . ). 
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iS 





Every patch-in use generates an sequence of 0x0s in compiled 
code, of the right legnth for the missing hardware instruction(s) to 
be patched in — and add a pointer to the “hole” into a global table 
in a different assembly section, along with an id for the specialized 
instruction and the patch-in case (unconditional branch, branch-and-link, 


branch-if-less-than-zero. . . ). 


(Macro-expanded) GNU C, simplified 


asm goto (".pushsection .data, 42\n" 
" „quad hole_to_fill_%/=\n" 
z .quad " SPECIALIZED_INSTRUCTION_ID "\n" 
y .quad " PATCH_IN_CASE "\n" 
" popsection\n" 
"hole_to_fill_%=:\n" 
is .skip " ROUTINE_LENGTH_IN_BYTES "\n" 
: : /* inputs... */ 
: unreachable_label_jumping_where_gcc_cant_know) ; 





65/71 
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No-threading ad literals in asm Branches Patch-ins 


Patch-ins in action 





The assembly section containing the global table is scanned to 
compute the addresses to patch within replicated code. 


Jumps generated this way, and some inline asm for conditional 
branches, can make VM branches optimal on a given architecture. 
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Patch-ins in action 





The assembly section containing the global table is scanned to 
compute the addresses to patch within replicated code. 


Jumps generated this way, and some inline asm for conditional 
branches, can make VM branches optimal on a given architecture. 


[Demo: disassembling and timing the down-counter under 
no-threading dispatch] 
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Closing Is this a JIT? The future Thanks 


What should | call this? 





Am | still speaking of efficient interpreters, or have | already crossed 
into JIT territory? The answer may be blurry, particularly with 
respect to common public expectations. 
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ə a software attempting to pass for a JIT without success 
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| will avoid the question, and call the software a generator of 
efficient ‘virtual machines”. 


My VM generator is called Jitter, and a VM generated by Jitter will 
be “Jittery”. You are free to follow your imagination in interpreting 
the name. Here are some possibilities: 

ə a software attempting to pass for a JIT without success 


@ a maker of JITs 
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Closing Is this a JIT? T 


What should | call this? 





Am | still speaking of efficient interpreters, or have | already crossed 
into JIT territory? The answer may be blurry, particularly with 
respect to common public expectations. 


| will avoid the question, and call the software a generator of 
efficient ‘virtual machines’. 


My VM generator is called Jitter, and a VM generated by Jitter will 
be “Jittery”. You are free to follow your imagination in interpreting 
the name. Here are some possibilities: 


ə a software attempting to pass for a JIT without success 
ə a maker of JITs 


ə something shaky and unreliable 
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Closing s this a JIT? The future Thanks 


The near future 





I'm releasing Jitter’s code right now, for the first time. 


http: //ageinghacker .net/ghm-2017 


There are rough edges but the code is not terrible. If you like 
languages you'll have fun. 
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The near future 





I'm releasing Jitter’s code right now, for the first time. 


http: //ageinghacker .net/ghm-2017 


There are rough edges but the code is not terrible. If you like 
languages you'll have fun. 


ə | want to propose Jitter as a GNU project. 

e@ Implementation-wise, rewrite rules are the most urgent thing. 
[I also have to actually use the Array; that’s easy and will be ready soon, 
possibly before the GHM is over. Hierarchical wrapped globals will have 


to wait a little.] 
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Closing s this a JIT? The future Thank 


The near future 





I'm releasing Jitter’s code right now, for the first time. 


http: //ageinghacker .net/ghm-2017 


There are rough edges but the code is not terrible. If you like 
languages you'll have fun. 


ə | want to propose Jitter as a GNU project. 


e@ Implementation-wise, rewrite rules are the most urgent thing. 
[I also have to actually use the Array; that’s easy and will be ready soon, 
possibly before the GHM is over. Hierarchical wrapped globals will have 
to wait a little] 

ə | have to finish the manual. Of the already existing part | 
strongly recommend the section about when not to use VMs 
in the introduction. 
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Closing 





Thank you 


Also thanks to the people from whose work | learned the bases on 
which | built Jitter, particularly Anton Ertl. See the bibliography on 
slide 70, and the NOTES file in the tarball. 


My virtual machine is faster 
than yours. 


Any questions? 


Are you thinking of some application for Jitter? Tell me. 





Luca Saiu http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Closing 





Bibliography | 











Luca Saiu 


Ertl, M. A. (2008). The Vmgen manual. The manual is in 
Texinfo, distributed along with GForth. Do a M-x info vmgen 
if you use the Emacs Info reader. 


Ertl, M. A. and Gregg, D. (2004). Retargeting JIT compilers by 
using C-compiler generated executable code. In Proceedings of 
the 13th International Conference on Parallel Architectures and 
Compilation Techniques, PACT '04, pages 41-50, Washington, 
DC, USA. IEEE Computer Society. 


Ertl, M. A., Gregg, D., Krall, A., and Paysan, B. (2002). 
Vmgen — a generator of efficient virtual machine interpreters. 
SoftwarePractice and Experience, 32:2002. 





http: //ageinghacker .net The art of the language VM — GNU Hackers’ Meeting 2017 


Closing 


Bibliography ll 








B 





Luca Saiu 


Saiu, L. (2017). The Jitter NOTES file. The NOTES file in the 
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more references. Not really a literature review, but at least a 
list of useful pointers to scientific publications. 
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conclusions. 
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